Single-Layer Networks: Regression

Bio-statistical Learning

Santiago Alférez

Introduction

  • This chapter explores basic neural network concepts via linear regression.
  • Linear regression models represent a simple, single-layer neural network.
  • Why start here?
    • Limited practical use on their own.
    • Possess simple analytical properties.
    • Excellent for introducing core concepts fundamental to deep neural networks.
  • Goal for today: Understand the building blocks before moving to complex architectures.

1. Linear Regression: The Goal

  • Regression Task: Predict one or more continuous target variables \(t\) given a \(D\)-dimensional input vector \(\mathbf{x}\).
  • Training Data: We are given \(N\) observations \(\{\mathbf{x}_n\}\) and corresponding target values \(\{t_n\}\).
  • Model: We formulate a function \(y(\mathbf{x}, \mathbf{w})\) that makes predictions.
    • \(\mathbf{w}\) represents a vector of learnable parameters.
  • Simplest Model: A linear combination of input variables: \[y(\mathbf{x},\mathbf{w}) = w_0 + w_1x_1 + \dots + w_Dx_D \quad (1)\]
    • \(\mathbf{x}=(x_{1},...,x_{D})^{T}\). \(w_0\) is the bias, \(w_1, \dots, w_D\) are weights.
    • Key Property: Linear function of parameters \(w_j\).
    • Limitation: Also a linear function of input variables \(x_i\), restricting its ability to model complex relationships.

1.1 Basis Functions: Adding Non-linearity

  • To address the limits of simple linear regression, introduce basis functions.
  • The model becomes a sum of nonlinear functions of input variables: \[y(\mathbf{x},\mathbf{w}) = w_0 + \sum_{j=1}^{M-1} w_j\phi_j(\mathbf{x})\]
    • \(\phi_j(\mathbf{x})\): basis functions.
    • \(M\) parameters (\(M-1\) basis + bias \(w_0\)).
  • Often use a dummy basis \(\phi_0(\mathbf{x}) = 1\): \[y(\mathbf{x},\mathbf{w}) = \sum_{j=0}^{M-1} w_j\phi_j(\mathbf{x}) = \mathbf{w}^T\phi(\mathbf{x})\]
    • \(\mathbf{w} = (w_0, \dots, w_{M-1})^T\)
    • \(\phi(\mathbf{x}) = (\phi_0(\mathbf{x}), \dots, \phi_{M-1}(\mathbf{x}))^T\)
  • Linear in \(\mathbf{w}\): \(y(\mathbf{x},\mathbf{w})\) can be nonlinear in \(\mathbf{x}\), but model remains linear in the parameters.

Visualizing Linear Regression with Basis Functions

Network Diagram

Figure 1: Linear regression model as a single-layer network. Each \(\phi_j(\mathbf{x})\) is an input, \(w_j\) are weights, \(y(\mathbf{x},\mathbf{w})\) is the output. The solid blue node represents the bias \(\phi_0(\mathbf{x})=1\).

Role of Basis Functions

  • Before deep learning, feature extraction (choosing good \(\phi_j(\mathbf{x})\)) was crucial.
  • Deep learning aims to learn these transformations from data.

Examples of Basis Functions

Basis Function Mathematical Form (often shown for scalar \(x\)) Description / Notes
Polynomials \(\phi_j(x) = x^j\) Applied to components of \(\mathbf{x}\) or scalar \(\mathbf{x}\)
Gaussian \(\phi_j(x) = \exp\left\{-\frac{(x-\mu_j)^2}{2s^2}\right\}\) \(\boldsymbol{\mu}_j\): center of \(\mathbf{x}\), \(s\): width. For vector \(\mathbf{x}\), form is \(\exp\left\{-\frac{||\mathbf{x}-\boldsymbol{\mu}_j||^2}{2s^2}\right\}\)
Sigmoidal \(\phi_j(x) = \sigma\left(\frac{x-\mu_j}{s}\right)\)
where \(\sigma(a) = \frac{1}{1+\exp(-a)}\)
For vector \(\mathbf{x}\), argument is often \(\mathbf{v}^T\mathbf{x} + v_0\) or similar projection.
  • While formulas are often shown for scalar \(x\) for simplicity (as in Figure 2), in a \(D\)-dimensional input space \(\mathbf{x}\), these functions would operate on \(\mathbf{x}\) or its components

Visualizing Basis Functions (1D Example)

a. Polynomials (\(x^j\))

b. Gaussians (\(\exp(-(x-\mu_j)^2/2s^2)\))

c. Sigmoids (\(\sigma((x-\mu_j)/s)\))

Figure 2: Examples of basis functions plotted against a single variable \(x\).

  • For now, our discussion is largely independent of the specific choice of basis functions \(\phi_j(\mathbf{x})\).
  • We’ll focus on a single target variable \(t\) for simplicity.

1.2 Likelihood Function: Probabilistic View

  • Assume target variable \(t\) is the model prediction \(y(\mathbf{x},\mathbf{w})\) plus additive Gaussian noise \(\epsilon\): \[t = y(\mathbf{x},\mathbf{w}) + \epsilon \quad (7)\]
    • \(\epsilon \sim \mathcal{N}(0, \sigma^2)\) (zero-mean Gaussian noise with variance \(\sigma^2\)).
  • This implies a conditional probability distribution for \(t\): \[p(t|\mathbf{x},\mathbf{w},\sigma^2) = \mathcal{N}(t|y(\mathbf{x},\mathbf{w}), \sigma^2) \quad (8)\]
  • Given a dataset \(X = \{\mathbf{x}_1, \dots, \mathbf{x}_N\}\) and targets \(\mathbf{t} = \{t_1, \dots, t_N\}\), assuming data points are drawn independently: Likelihood Function: \[p(\mathbf{t}|X,\mathbf{w},\sigma^2) = \prod_{n=1}^{N} \mathcal{N}(t_n|\mathbf{w}^T\phi(\mathbf{x}_n), \sigma^2) \quad (9)\]

Log-Likelihood and Error Function

  • It’s often easier to work with the log-likelihood: \[\ln p(\mathbf{t}|X,\mathbf{w},\sigma^2) = \sum_{n=1}^{N} \ln \mathcal{N}(t_n|\mathbf{w}^T\phi(\mathbf{x}_n), \sigma^2)\]

  • Using the form of a univariate Gaussian: \[\ln p(\mathbf{t}|X,\mathbf{w},\sigma^2) = -\frac{N}{2}\ln\sigma^2 - \frac{N}{2}\ln(2\pi) - \frac{1}{2\sigma^2}\sum_{n=1}^{N}\{t_n - \mathbf{w}^T\phi(\mathbf{x}_n)\}^2 \quad (10)\]

  • Let’s define the sum-of-squares error function: \[E_D(\mathbf{w}) = \frac{1}{2}\sum_{n=1}^{N}\{t_n - \mathbf{w}^T\phi(\mathbf{x}_n)\}^2 \quad (11)\]

  • Substituting (11) into (10): \[\ln p(\mathbf{t}|X,\mathbf{w},\sigma^2) = -\frac{N}{2}\ln\sigma^2 - \frac{N}{2}\ln(2\pi) - \frac{1}{\sigma^2}E_D(\mathbf{w}) \quad (12)\]

  • Key Insight: Maximizing the log-likelihood (w.r.t. \(\mathbf{w}\)) under a Gaussian noise assumption is equivalent to minimizing the sum-of-squares error \(E_D(\mathbf{w})\).

1.3 Maximum Likelihood Solution for \(\mathbf{w}\)

  • To find \(\mathbf{w}_{ML}\), we maximize \(\ln p(\mathbf{t}|X,\mathbf{w},\sigma^2)\) w.r.t. \(\mathbf{w}\).
  • This is equivalent to minimizing \(E_D(\mathbf{w})\). Setting the gradient of \(E_D(\mathbf{w})\) w.r.t \(\mathbf{w}\) to zero: \[\nabla_{\mathbf{w}} E_D(\mathbf{w}) = -\sum_{n=1}^{N}\{t_n - \mathbf{w}^T\phi(\mathbf{x}_n)\}\phi(\mathbf{x}_n)\] (Note: \(\phi(\mathbf{x}_n)\) is a column vector, so \(\phi(\mathbf{x}_n)^T\) for the derivative was implicit in prior slides, here making \(\phi(\mathbf{x}_n)\) itself the vector for the gradient term.)
  • Setting to zero gives: \[0 = \sum_{n=1}^{N}t_n\phi(\mathbf{x}_n)^T - \mathbf{w}^T\left(\sum_{n=1}^{N}\phi(\mathbf{x}_n)\phi(\mathbf{x}_n)^T\right) \quad (\text{compare with } 4.13)\]
  • Solving for \(\mathbf{w}_{ML}\): \[\mathbf{w}_{ML} = (\Phi^T\Phi)^{-1}\Phi^T\mathbf{t} \quad (15)\]
    • These are the normal equations. \(\mathbf{t}\) is the vector of target values.

Design Matrix and Pseudo-Inverse

  • In \(\mathbf{w}_{ML} = (\Phi^T\Phi)^{-1}\Phi^T\mathbf{t}\):
    • \(\mathbf{t}\) is the column vector \((t_1, \dots, t_N)^T\).
    • \(\Phi\) is the \(N \times M\) design matrix: \[\Phi = \begin{pmatrix} \phi_0(\mathbf{x}_1) & \phi_1(\mathbf{x}_1) & \dots & \phi_{M-1}(\mathbf{x}_1) \\ \phi_0(\mathbf{x}_2) & \phi_1(\mathbf{x}_2) & \dots & \phi_{M-1}(\mathbf{x}_2) \\ \vdots & \vdots & \ddots & \vdots \\ \phi_0(\mathbf{x}_N) & \phi_1(\mathbf{x}_N) & \dots & \phi_{M-1}(\mathbf{x}_N) \end{pmatrix} \quad (16)\] (Each \(\phi_j(\mathbf{x}_n)\) is a scalar output of the \(j\)-th basis function for the \(n\)-th input vector)
  • The term \(\Phi^\dagger \equiv (\Phi^T\Phi)^{-1}\Phi^T\) is the Moore-Penrose pseudo-inverse of \(\Phi\). (17)
    • So, \(\mathbf{w}_{ML} = \Phi^\dagger \mathbf{t}\).

Role of the Bias Parameter \(w_0\)

  • Error function with explicit \(w_0\) (where \(\mathbf{w}\) now excludes \(w_0\)): \[E_D(\mathbf{w}) = \frac{1}{2}\sum_{n=1}^{N}\left\{t_n - w_0 - \sum_{j=1}^{M-1}w_j\phi_j(\mathbf{x}_n)\right\}^2 \quad (18)\]

  • Setting \(\frac{\partial E_D}{\partial w_0} = 0\) and solving for \(w_0\): \[w_0 = \bar{t} - \sum_{j=1}^{M-1}w_j\bar{\phi}_j \quad (19)\] where \(\bar{t} = \frac{1}{N}\sum_{n=1}^{N}t_n\) and \(\bar{\phi}_j = \frac{1}{N}\sum_{n=1}^{N}\phi_j(\mathbf{x}_n)\). (20)

  • Interpretation: \(w_0\) compensates for differences in averages.

Maximum Likelihood Solution for \(\sigma^2\)

  • Maximize log-likelihood (12) w.r.t. \(\sigma^2\): \[\sigma_{ML}^2 = \frac{1}{N}\sum_{n=1}^{N}\{t_n - \mathbf{w}_{ML}^T\phi(\mathbf{x}_n)\}^2 \quad (21)\]
  • Interpretation: \(\sigma_{ML}^2\) is the average residual variance around the regression function.

1.4 Geometry of Least Squares

Geometric View

  • N-dim space: axes \(t_n\). \(\mathbf{t} = (t_1, \dots, t_N)^T\).
  • Basis vectors \(\boldsymbol{\varphi}_j = (\phi_j(\mathbf{x}_1), \dots, \phi_j(\mathbf{x}_N))^T\). (Each \(\boldsymbol{\varphi}_j\) is a column in \(\Phi\))
  • These span an \(M\)-dim subspace \(S\) (if \(M<N\)).
  • Prediction \(\mathbf{y}\) (elements \(y(\mathbf{x}_n, \mathbf{w})\)) lies in \(S\).
  • \(E_D(\mathbf{w}) = \frac{1}{2}||\mathbf{y} - \mathbf{t}||^2\).
  • Minimizing \(E_D(\mathbf{w})\) finds \(\mathbf{y} \in S\) closest to \(\mathbf{t}\).
  • Solution: \(\mathbf{y}\) is the orthogonal projection of \(\mathbf{t}\) onto \(S\).

Diagram

Figure 3: The vector \(\mathbf{t}\) is projected onto the subspace \(S\) spanned by basis function vectors \(\boldsymbol{\varphi}_j\). The projection is \(\mathbf{y}\).

Numerical Issues: If \(\Phi^T\Phi\) is singular or ill-conditioned, consider using SVD or adding regularization to improve numerical stability.

1.5 Sequential Learning (Online Algorithms)

  • \(\mathbf{w}_{ML} = (\Phi^T\Phi)^{-1}\Phi^T\mathbf{t}\) is a batch method.
    • Costly for large \(N\)
  • Sequential (online) algorithms: Process data points one at a time.
    • Good for large datasets / real-time.
  • Stochastic Gradient Descent (SGD): If error \(E = \sum_n E_n\), update after point \(n\): \[\boxed{\color{blue}{\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} - \eta \nabla E_n \quad (22)}}\]
    • \(\mathbf{w}^{(\tau)}\): params at iteration \(\tau\).
    • \(\eta\): learning rate.
    • \(\nabla E_n\): gradient for \(n\)-th data point.

LMS Algorithm (Least Mean Squares)

  • For sum-of-squares error, \(E_n = \frac{1}{2}\{t_n - \mathbf{w}^T\phi(\mathbf{x}_n)\}^2\).
  • \(\nabla E_n = -(t_n - \mathbf{w}^T\phi(\mathbf{x}_n))\phi(\mathbf{x}_n)\). (Here \(\phi(\mathbf{x}_n)\) is the vector of basis function outputs for \(\mathbf{x}_n\))
  • SGD update rule (22) becomes: \[\mathbf{w}^{(\tau+1)} = \mathbf{w}^{(\tau)} + \eta(t_n - \mathbf{w}^{(\tau)T}\phi(\mathbf{x}_n))\phi(\mathbf{x}_n) \quad (23)\]
    • Known as Least-Mean-Squares (LMS) or Widrow-Hoff rule.

1.6 Regularized Least Squares

  • Regularization adds a penalty to control overfitting.
  • Total error function: \[E_{total}(\mathbf{w}) = E_D(\mathbf{w}) + \lambda E_W(\mathbf{w}) \quad (24)\]
    • \(E_D(\mathbf{w})\): data error.
    • \(E_W(\mathbf{w})\): regularization term.
    • \(\lambda\): regularization coefficient. . . .
  • Common regularizer: L2 regularization (ridge regression): \[E_W(\mathbf{w}) = \frac{1}{2}\sum_{j=0}^{M-1} w_j^2 = \frac{1}{2}\mathbf{w}^T \mathbf{w} \quad (25)\]

Solution for Regularized Least Squares

  • Total error with L2 regularization: \[\frac{1}{2}\sum_{n=1}^{N}\{t_n - \mathbf{w}^T\phi(\mathbf{x}_n)\}^2 + \frac{\lambda}{2}\mathbf{w}^T \mathbf{w} \quad (26)\]
  • Set gradient w.r.t. \(\mathbf{w}\) to zero: \[(\Phi^T\Phi + \lambda I)\mathbf{w} = \Phi^T\mathbf{t}\]
  • Solution: \[\mathbf{w} = (\lambda I + \Phi^T\Phi)^{-1}\Phi^T\mathbf{t} \quad (27)\]
    • \(I\) is identity matrix. \(\lambda I\) helps with singularity.
    • Shrinks weights towards zero.

1.7 Multiple Outputs

  • To predict \(K > 1\) target variables \(\mathbf{t} = (t_1, \dots, t_K)^T\).
  • Use same basis functions \(\phi(\mathbf{x})\) for all \(K\) outputs: \[\mathbf{y}(\mathbf{x},W) = W^T\phi(\mathbf{x}) \quad (28)\]
    • \(\mathbf{y}(\mathbf{x},W)\): \(K\)-dim vector.
    • \(W\): \(M \times K\) parameter matrix. (Each column is a weight vector \(\mathbf{w}_k\))
    • \(\phi(\mathbf{x})\): \(M\)-dim basis vector.

Network for Multiple Outputs

Figure 4: Linear regression for multiple outputs \(y_1, \dots, y_K\). Each output \(y_k\) is a linear combination of the basis functions \(\phi_j(\mathbf{x})\) with its own set of weights (a column in \(W\)).

Likelihood for Multiple Outputs

  • Assume isotropic Gaussian conditional distribution: \[p(\mathbf{t}|\mathbf{x},W,\sigma^2) = \mathcal{N}(\mathbf{t}|W^T\phi(\mathbf{x}), \sigma^2I) \quad (29)\]
  • Log-likelihood for \(N\) observations (targets \(T\) as \(N \times K\) matrix): \[\ln p(T|X,W,\sigma^2) = \sum_{n=1}^{N} \ln \mathcal{N}(\mathbf{t}_n|W^T\phi(\mathbf{x}_n), \sigma^2I)\] \[= -\frac{NK}{2}\ln(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{n=1}^{N}||\mathbf{t}_n - W^T\phi(\mathbf{x}_n)||^2 \quad (30)\]

ML Solution for Multiple Outputs

  • Maximizing log-likelihood (30) w.r.t. \(W\): \[W_{ML} = (\Phi^T\Phi)^{-1}\Phi^T T \quad (31)\]
  • For each column \(\mathbf{w}_k\) of \(W_{ML}\) (params for \(k\)-th output) and \(\mathbf{t}_k\) (vector of \(N\) targets for \(k\)-th output) of \(T\): \[\mathbf{w}_k = (\Phi^T\Phi)^{-1}\Phi^T\mathbf{t}_k = \Phi^\dagger\mathbf{t}_k \quad (32)\]
  • Key Result: Regression decouples for each target. Pseudo-inverse \(\Phi^\dagger\) is shared.

2. Decision Theory: Making Predictions

  • We’ve learned to model \(p(t|\mathbf{x})\), e.g., \(\mathcal{N}(t|y(\mathbf{x},\mathbf{w}_{ML}), \sigma_{ML}^2)\). This is our predictive distribution.
  • But often, we need to make a single, concrete prediction \(f(\mathbf{x})\).
    • Analogy: A weather model gives a 70% chance of rain (\(p(t|\mathbf{x})\)). You need to decide: “Take an umbrella” or “Leave it” (\(f(\mathbf{x})\)).
  • Two Stages:
    1. Inference Stage: Learn \(p(t|\mathbf{x})\) from training data. (We’ve done this!)
    2. Decision Stage: Choose an optimal prediction \(f(\mathbf{x})\) using \(p(t|\mathbf{x})\) and a loss function \(L(t, f(\mathbf{x}))\).
      • A loss function measures the “cost” or “error” if the true value is \(t\) and we predict \(f(\mathbf{x})\).

What’s a “Good” Prediction? Expected Loss

  • How do we measure how “wrong” our prediction \(f(\mathbf{x})\) is compared to the true value \(t\)?
  • A very common way for regression is the squared loss: \[L(t, f(\mathbf{x})) = \{f(\mathbf{x})-t\}^2\]
    • Why squared? It punishes larger errors much more than small errors. It’s also mathematically convenient!
  • We want to choose \(f(\mathbf{x})\) that minimizes the average or expected loss over all possible values of \(\mathbf{x}\) and \(t\): \[\mathbb{E}[L] = \iint \{f(\mathbf{x})-t\}^2 p(\mathbf{x},t)d\mathbf{x} dt \quad (35)\]
    • \(p(\mathbf{x},t)\) is the true joint probability of \(\mathbf{x}\) and \(t\).
  • Our goal: Find \(f(\mathbf{x})\) that makes \(\mathbb{E}[L]\) as small as possible.

Optimal Prediction for Squared Loss

  • If we use the squared loss \(L(t,f(\mathbf{x})) = \{f(\mathbf{x})-t\}^2\):
  • The prediction \(f^*(\mathbf{x})\) that minimizes the expected squared loss \(\mathbb{E}[L]\) is the conditional mean of \(t\) given \(\mathbf{x}\): \[\boxed{f^*(\mathbf{x}) = \mathbb{E}_t[t|\mathbf{x}] = \int t\, p(t|\mathbf{x})\, dt} \quad (37)\]
    • This means: for any given input \(\mathbf{x}\), the “best” prediction is the average of all possible true target values \(t\) that could occur for that \(\mathbf{x}\).
    • This \(f^*(\mathbf{x})\) is often called the regression function.
  • For our Gaussian model: \(p(t|\mathbf{x}) = \mathcal{N}(t|y(\mathbf{x},\mathbf{w}), \sigma^2)\).
    • The conditional mean is simply: \(\mathbb{E}_t[t|\mathbf{x}] = y(\mathbf{x},\mathbf{w}) \quad (38)\)
    • So, the optimal prediction is our model’s output \(y(\mathbf{x},\mathbf{w}_{ML})\)!

Figure 5: The optimal prediction \(f^*(x)\) (red curve) is the mean of the conditional distribution \(p(t|x_0)\) (blue curve) for each input \(x_0\). (Shown for scalar \(x\) for visualization)

Decomposing the Expected Squared Loss

  • Let’s look closer at the expected squared loss (eq. 35). If we make the prediction \(f(\mathbf{x})\), the expected loss can be broken down: \[\mathbb{E}[L] = \underbrace{\int \{f(\mathbf{x})-\mathbb{E}[t|\mathbf{x}]\}^2 p(\mathbf{x})d\mathbf{x}}_{\text{Term 1: Our Model's Contribution to Error}} + \underbrace{\int \text{var}[t|\mathbf{x}] p(\mathbf{x})d\mathbf{x}}_{\text{Term 2: Irreducible Error (Noise)}} \quad (\text{derived from } 39)\]
  • Term 1: Our Model’s Contribution
    • This part depends on how different our prediction \(f(\mathbf{x})\) is from the ideal prediction \(\mathbb{E}[t|\mathbf{x}]\) (the true conditional mean).
    • If we choose \(f(\mathbf{x}) = \mathbb{E}[t|\mathbf{x}]\) (the optimal prediction), this term becomes zero!
  • Term 2: Irreducible Error (Noise)
    • \(\text{var}[t|\mathbf{x}]\) is the variance of \(t\) given \(\mathbf{x}\). It’s the inherent randomness or “noise” in the data that our model cannot predict.
    • This term represents the minimum achievable expected loss, even with a perfect model that knows \(\mathbb{E}[t|\mathbf{x}]\).
    • This is the fundamental limit due to the nature of the data itself.

3. The Bias-Variance Trade-off: The Challenge

  • Goal: We want models that predict well on new, unseen data (generalization).
  • But models can be tricky!
    • Too simple? Might miss the true pattern (underfitting).
    • Too complex? Might fit the training noise, not the pattern (overfitting).
  • The Bias-Variance Trade-off helps us understand this.
  • Analogy: The Archer
    • High Bias Archer: Consistently misses the bullseye in the same direction (e.g., always high and left). Their average shot is off. (Model is too simple, systematically wrong).
    • High Variance Archer: Shots are scattered all around the target. No consistent miss, but widely spread. (Model is too sensitive, predictions change wildly with different data).
    • Good Archer: Hits near the bullseye, consistently (Low Bias, Low Variance). This is our goal!

Understanding Prediction Errors: Bias and Variance

  • Let \(h(\mathbf{x}) = \mathbb{E}[t|\mathbf{x}]\) be the true, optimal regression function (the bullseye).
  • Our model \(f(\mathbf{x}; \mathcal{D})\) is trained on a specific dataset \(\mathcal{D}\).
  • If we could average over many possible datasets \(\mathcal{D}\), the expected squared difference between our model’s prediction \(f(\mathbf{x}; \mathcal{D})\) and the true function \(h(\mathbf{x})\) for a given \(\mathbf{x}\) can be broken down:

\[\mathbb{E}_{\mathcal{D}}[\{f(\mathbf{x}; \mathcal{D}) - h(\mathbf{x})\}^2] = \underbrace{\{\mathbb{E}_{\mathcal{D}}[f(\mathbf{x}; \mathcal{D})] - h(\mathbf{x})\}^2}_{\text{(bias)}^2} + \underbrace{\mathbb{E}_{\mathcal{D}}[\{f(\mathbf{x}; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[f(\mathbf{x}; \mathcal{D})]\}^2]}_{\text{variance}} \quad (44)\]

  • Bias (Squared): \(\{\mathbb{E}_{\mathcal{D}}[f(\mathbf{x}; \mathcal{D})] - h(\mathbf{x})\}^2\)
    • \(\mathbb{E}_{\mathcal{D}}[f(\mathbf{x}; \mathcal{D})]\) is the average prediction our model type would make for input \(\mathbf{x}\), if trained on many different datasets.
    • Bias measures how far this average model prediction is from the true function \(h(\mathbf{x})\).
    • High bias: Model is fundamentally “off target”, too simple, or makes systematic errors.
  • Variance: \(\mathbb{E}_{\mathcal{D}}[\{f(\mathbf{x}; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[f(\mathbf{x}; \mathcal{D})]\}^2]\)
    • Measures how much our model’s predictions \(f(\mathbf{x}; \mathcal{D})\) scatter or change if we train it on different specific datasets \(\mathcal{D}\).
    • High variance: Model is too sensitive to the training data; it overfits the noise and doesn’t generalize well.

Overall Expected Loss Decomposition

Integrating over all \(\mathbf{x}\), the total expected loss of our model is: \[\text{Expected Loss} = (\text{Bias})^2 + \text{Variance} + \text{Noise} \quad (\text{based on } 45)\]

Where:

  • \((\text{Bias})^2 = \int \{\mathbb{E}_{\mathcal{D}}[f(\mathbf{x}; \mathcal{D})] - h(\mathbf{x})\}^2 p(\mathbf{x})d\mathbf{x} \quad (46)\)
    • Systematic error from model being too simple.
  • \(\text{Variance} = \int \mathbb{E}_{\mathcal{D}}[\{f(\mathbf{x}; \mathcal{D}) - \mathbb{E}_{\mathcal{D}}[f(\mathbf{x}; \mathcal{D})]\}^2] p(\mathbf{x})d\mathbf{x} \quad (47)\)
    • Error from model’s sensitivity to specific training data (overfitting).
  • \(\text{Noise} = \int \text{var}[t|\mathbf{x}] p(\mathbf{x})d\mathbf{x} \quad (\text{var}[t|\mathbf{x}] = \mathbb{E}_t[\{t-h(\mathbf{x})\}^2|\mathbf{x}])\)
    • Irreducible error due to inherent data variability.

Our Challenge: Minimize \((\text{Bias})^2 + \text{Variance}\). There’s often a trade-off: - Very simple models: Low Variance, High Bias. - Very complex models: High Variance, Low Bias.

Visualizing Bias-Variance Trade-off

a) Fit, large \(\lambda\) (Simple) c) Fit, med \(\lambda\) (Balanced) e) Fit, small \(\lambda\) (Complex)
b) Avg fit, large \(\lambda\) d) Avg fit, med \(\lambda\) f) Avg fit, small \(\lambda\)

Figure 7: True function \(h(x)\) (green, for a 1D example). Fits \(f^{(l)}(x)\) from 20 datasets (blue). Average fit \(\bar{f}(x)\) (red).

  • Top Row (Large \(\lambda\)):
    • Blue lines are similar (Low Variance)
    • Average fit (red) is far from true function (green) (High Bias)
    • Model is too simple
  • Middle Row (Medium \(\lambda\)):
    • Balance between bias and variance
    • Variance is controlled, bias not too high
  • Bottom Row (Small \(\lambda\)):
    • Blue lines vary wildly (High Variance)
    • Average fit (red) is closer to true function (green) (Low Bias)
    • Predictions are unstable (overfitting risk)

Quantitative Bias-Variance Trade-off

Figure 8: Plot of squared bias, variance, their sum, and test error versus \(\ln \lambda\) for a 1D example.

  • \(\ln \lambda\) on x-axis: Controls model complexity.
    • Large \(\ln \lambda\) (right): Strong regularization, simpler model.
    • Small \(\ln \lambda\) (left): Weak regularization, more complex model.
  • Observe the Trade-off:
    • Simple models (right): High Bias, Low Variance.
    • Complex models (left): Low Bias, High Variance.
  • The Total Error (Bias² + Variance) has a minimum. This is the “sweet spot” for \(\lambda\) we want to find!
  • Regularization (choosing \(\lambda\)) is key to navigating this trade-off.
  • (Approximate calculations for specific data points \(\mathbf{x}_n\): \(\bar{f}(\mathbf{x}_n) = \frac{1}{L}\sum_l f^{(l)}(\mathbf{x}_n)\) \((\text{bias})^2 \approx \frac{1}{N_{test}}\sum_n \{\bar{f}(\mathbf{x}_n) - h(\mathbf{x}_n)\}^2\) \(\text{variance} \approx \frac{1}{N_{test}}\sum_n \frac{1}{L}\sum_l \{f^{(l)}(\mathbf{x}_n) - \bar{f}(\mathbf{x}_n)\}^2\) )

Limitations of Bias-Variance View

  • Practical Value: Hard to calculate exact bias and variance for real problems (needs many datasets or strong assumptions).
  • Conceptual Value: Extremely useful for intuition about model complexity, underfitting, overfitting, and the role of regularization. It’s a powerful mental model.
  • Frequentist Concept: The Bias-Variance decomposition is primarily a frequentist idea. Bayesian approaches handle model complexity and overfitting differently (e.g., through marginalization).

Summary of Single-Layer Regression Networks

  • Linear regression as a simple neural network, predicting \(t\) from input vector \(\mathbf{x}\).
  • Basis functions \(\phi_j(\mathbf{x})\) for nonlinear relationships (model \(y(\mathbf{x},\mathbf{w})\) is linear in parameters \(\mathbf{w}\)).
  • Max likelihood (Gaussian noise) \(\iff\) min sum-of-squares error.
    • Closed-form solution (normal equations for \(\mathbf{w}_{ML}\)).
    • Sequential learning (LMS/SGD).
  • Regularization (L2) controls overfitting by managing model complexity.
  • Decision theory guides how to make optimal predictions (for squared loss, predict the conditional mean \(\mathbb{E}[t|\mathbf{x}]\)).
  • Bias-variance trade-off explains the relationship between model complexity, bias (underfitting), variance (overfitting), and generalization. Regularization helps find a good balance.

Questions & Discussion

  • Thank you!
  • Adapted from the book “Deep Learning” by Bishop & Bishop